-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
API: consistent NaN treatment for pyarrow dtypes #61732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@mroeschke when convenient id like to get your thoughts before getting this working. it looks pretty feasible. |
Generally +1 in this direction. Glad to see the changes to make this work are fairly minimal |
f1e8ba0
to
ca6e8e8
Compare
f547fab
to
02d12a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 as well; this is nice.
02d12a3
to
73a95d2
Compare
Not able to judge the implementation, but I'm +1 on the concept. |
dtype, na_value = to_numpy_dtype_inference( | ||
self, dtype, na_value, hasna, is_pyarrow=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change means to use object dtype instead of converting NA to NaNs?
We initially did that for the masked arrays conversion to numpy, but then changed it use NaNs, because constantly getting object dtype was too annoying (there is some issue discussing this IIRC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
While I am personally in favor of distinguishing NaN and NA, I think most of the changes here involve distinguishing NaN when constructing the arrays? (so eg constructing the pyarro-based EA from user input like numpy arrays?) Personally, I think that is a change we should only make after making those dtypes the default, and probably even years after that after a very long deprecation process. |
Yes. Constructors (which affect read_csv) and
My current thought (will bring up on today's dev call) is that we should add a global flag to enable both never-distinguish (see #61708) as the default and always distinguish (this) as opt-in. |
Based on last week's dev call, I am adapting this and #61708 from POCs to real PRs. This implements a global flag This PR only implements this for ArrowEA. #61708 will do the same for the numpy-nullables. (I have a branch trying to do it all at once and it is getting ungainly). A third PR will add tests for the various issues this closes. |
This is the third of several POCs stemming from the discussion in #61618 (see #61708, #61716). The main goal is to see how invasive it would be.
Specifically, this changes the behavior of pyarrow floating dtypes to treat NaN as distinct from NA in the constructors and
__setitem__
(xref #32265). Also in to_numpy, .valuesNotes:
This makes the decision to treat NaNs as close-enough to NA when a user explicitly asks for a pyarrow integer dtype. I think this is the right API, but won't check the box until there's a concensus.Changed this following Matt's opinion.1138990 failing tests locally.Most of these are in json, sql, or test_EA_types (which is about csv round-tripping).The kludge in NDFrame.where is ugly and fragile.Fixed.